Heavy- Tailed Processes for Selective Shrinkage 



Fabian L. Wauthier Michael I. Jordan 

Computer Science Division Computer Science Division 

University of California, Berkeley University of California, Berkeley 
f IwOcs . berkeley . edu j ordan@cs . berkeley . edu 



Abstract 

Heavy-tailed distributions are frequently used to enhance the robustness of regression and 
classification methods to outliers in output space. Often, however, we are confronted with 
"outliers" in input space, which are isolated observations in sparsely populated regions. We 
show that heavy-tailed stochastic processes (which we construct from Gaussian processes via 
a copula), can be used to improve robustness of regression and classification estimators to 
such outliers by selectively shrinking them more strongly in sparse regions than in dense 
regions. We carry out a theoretical analysis to show that selective shrinkage occurs, pro- 
vided the marginals of the heavy-tailed process have sufficiently heavy tails. The analysis is 
complemented by experiments on biological data which indicate significant improvements of 
estimates in sparse regions while producing competitive results in dense regions. 

Gaussian process classifiers (GPCs) [12 provide a Bayesian approach to nonparametric classifi- 
cation with the key advantage of producing predictive class probabilities. Unfortunately, when 
training data are unevenly sampled in input space, GPCs tend to overfit in the sparsely populated 
regions. Our work is motivated by an application to protein folding where this presents a major 
difficulty. In particular, while Nature provides samples of protein configurations near the global 
minima of free energy functions, protein- folding algorithms necessarily explore regions far from 
the minimum. If the estimate of free energy is poor in those sparsely-sampled regions then the 
algorithm has a poor guide towards the minimum. More generally this problem can be viewed as 
one of "covariate shift," where the sampling pattern differs in the training and testing phase. 

In this paper we investigate a GPC-based approach that addresses overfitting by shrinking predic- 
tive probabilities towards conservative values. For an unevenly sampled input space it is natural 
to consider a selective shrinkage strategy: we wish to shrink probability estimates more strongly in 
sparse regions than in dense regions. To this end several approaches could be considered. If sparse 
regions can be readily identified, selective shrinkage could be induced by tailoring the Gaussian 
process (GP) kernel to reflect that information. In the absence of such knowledge, Goldberg and 
Williams ^ showed that Gaussian process regression (GPR) can be augmented with a GP on the 
log noise level. More recent work has focused on partitioning input space into discrete regions 
and defining different kernel functions on each. Treed Gaussian process regression ^ and Treed 
Gaussian process classification [1 represent advanced variations of this theme that define a prior 
distribution over partitions and their respective kernel hyperparameters. Another line of research 
which could be adapted to this problem posits that the covariate space is a nonlinear deformation 
of another space on which a Gaussian process prior is placed [31 [T3] . Instead of directly modifying 
the kernel matrix, the observed non-uniformity of measurements is interpreted as being caused by 
the spatial deformation. A difficulty with all these approaches is that posterior inference is based 
on MCMC, which can be overly slow for the large-scale problems that we aim to address. 

This paper presents an alternative approach to selective shrinkage which replaces the Gaussian 
process underlying GPC with a stochastic process that has heavy-tailed marginals (e.g., Laplace, 
hyperbolic secant, or Student-i). While heavy-tailed marginals are generally viewed as providing 



robustness to outliers in the output space (i.e., the response space), the selective shrinkage notion 
can be viewed as a form of robustness to outliers in the input space (i.e., the covariate space). 
Indeed, selective shrinkage means the data points that are far from other data points in the input 
space are regularized more strongly. We provide a theoretical analysis and empirical results to 
show that inference based on stochastic processes with heavy-tailed marginals yields precisely this 
kind of shrinkage. 

The paper is structured as follows: Section [l] provides background on GPCs. We present a 
construction of heavy-tailed stochastic processes in Section [2] and show that inference reduces to 
standard computations in a Gaussian process. An analysis of our approach is presented in Section|3] 
and details on inference algorithms are presented in Section |4] Experiments on biological data 
in Section [5] demonstrate that heavy-tailed process classification substantially outperforms GPC 
in sparse regions while performing competitively in dense regions. The paper concludes with an 
overview of related research and final remarks in Sections |6] and [71 



1 Gaussian process classification and shrinkage 

A Gaussian process (GP) [T2] is a prior on functions z : X —?' M. defined through a mean function 
(usually identically zero) and a symmetric positive semidefinite kernel k{-,-). For a finite set of 
locations X = {xi, . . . ,Xn) we write z{X) ^ p{z{X)) = N'{0,K{X,X)) as a random variable 
distributed according to the GP with finite-dimensional kernel matrix [K(X, X)]i j = k(xi,Xj). 
Let y denote a vector of binary class labels associated with measurement locations aQ For 
Gaussian process classification (GPC) [H] the probability that a test point x^, is labeled as class 
= -1-1, given training data {X,y), is computed as 

p(,. =+l|A,y,x.) ^E,(.(..)l^,,,..) (tT^xpF^) 

p(z(xO|A,j/,x,) = J pizix,)\X,z{X),x,)p{ziX)\X,y)dz{X). 

The predictive distribution p(z(a;*)|A, y, x*) represents a regression on z{x*) with a complicated 
observation model y\z. From Eq. ([T]) we observe that we could selectively shrink the prediction 
Piy* = +1|A, ?/,x*) towards a conservative value 1/2 by selectively shrinking p{z{x^,)\X,y,x^,) 
closer to a point mass at zero. Our paper takes this intuition and shows that such selective 
shrinkage can be achieved by replacing the GP underlying GPC with a stochastic process that has 
sufficiently heavy tails. 



2 Heavy-tailed stochastic processes via the Gaussian copula 

In this section we construct the heavy-tailed stochastic process by transforming a GP. As with the 
GP, we will treat the new process as a prior on functions. Suppose that dia.g{K{X,X)) = tr^l. 
We define the heavy-tailed process /(A) with marginal c.d.f. Gi, as 

z(A)^AA(0,i^(A,A)) (2) 
u(A) = <i>o,,2(z(A)) (3) 
/(A) = G^'{u{X)) = G,-i(<I>o,.2(z(A))). 

Here the function $0,0-2 (■) is the c.d.f. of a centered Gaussian with variance a^. Presently, we 
only consider the case when Gb is the (continuous) c.d.f. of a heavy-tailed density gi, with scale 
parameter b that is symmetric about the origin. Examples include the Laplace, hyperbolic secant 

^To improve the clarity of exposition, we only deal with binary classification for now. A full multiclass classifi- 
cation model will be used for our experiments. 
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and Student-i distribution. We note that other authors have considered asymmetric or even 
discrete distributions [2] 1111 116j while Snelson et al. |15j use arbitrary monotonic transformations 
in place of Gj^^ ($0,0-2 (•))■ The process u{X) has the density of a Gaussian copula [lOl [16] and is 
critical in transferring the correlation structure encoded by K{X,X) from z{X) to f{X). If we 
define z{f{X)) = $"^2 (Gb(/(X))), it is well known [3 [i [TTl IISI HI that the density of f{X) 
takes the form 



Y{^=^9b{f{x,)) 
\K{X,X)/o^\^/^ 



exp 



K{X,X)-'- 



(4) 



Observe that if K{X^X) = a^I then p{f{X)) = Y[i=i dbifi^i))- ^ prior Gaussian process with 
independent components induces a Heavy-tailed process with independent components. Also note 
that if Gb were chosen to be Gaussian, we would recover the Gaussian process. The predictive 
distribution /(X), a:*) can be interpreted as a Heavy-tailed process regression (HPR). 

It is easy to see that its computation can be reduced to standard computations in a Gaussian 
model by nonlinearly transforming observations f{X) into z-space. Specifically, the predictive 
distribution in z-space satisfies 



p{z{x,)\x, f{x),x,) = 

tU^K{x.,,X)KiX,X)-'zifiX)) 

= K{x,,x,) - K{x,,X)K{X,X)-^K{X,x,). 



(5) 
(6) 
(7) 



The corresponding distribution in /-space follows by another change of variables. Having defined 
the heavy-tailed stochastic process in general we now turn to analyze its shrinkage properties. 



3 Selective shrinkage 

By "selective shrinkage" we mean that the degree of shrinkage applied to a collection of estimators 
varies across estimators. As motivated in Section [l} we are specifically interested in selectively 
shrinking posterior distributions near isolated locations more strongly than in dense regions. This 
section shows that by changing the form of prior marginals (heavy-tailed instead of Gaussian) we 
can induce stronger selective shrinkage than any GPR. Since HPR uses a GP in its construction, 
which can induce (some) selective shrinkage on its own, care must be taken to investigate only the 
additional benefits the transformation G|j'^($o tr^i')) has on shrinkage. For this reason we assume 
a particular GP prior which leads to a special type of shrinkage in GPR and then check how an 
HPR model built on top of that GP changes the observed behavior. 

In this section we provide an idealized analysis of that allows us to compare the selective shrinkage 
obtained by GPR and HPR. Note that we focus on regression in this section so that we can 
obtain analytical results. We work with n measurement locations, X ~ {xi, . . . , a;„), whose index 
set {1, . . . ,n} can be partitioned into a "dense" set D with \D\ — n — 1 and a single "sparse" 
index s ^ D. Assume that Xd = Xd' Vd, d' G D so that we may let (without loss of generality) 
K{xd, Xd') = I yd ^ d' E D. We also assert that Xd ^ Xg \/d E D and let K{xd, Xs) — K{xs, Xd) — 
yd E D. Assuming that n > 2 we fix the remaining entry K(xs,Xs) = e/(e + n — 2), for some 
e > 0. We interpret e as a noise variance and let K = K + el. The set of locations X idealizes 
an uneven sampling of input space, consisting of a densely and a sparsely sampled region, as 
represented by D and s. 

Denote any distributions computed under the GPR model by Pgp(-) and those computed in HPR 
by Php{-)- Using K{X,X) = K, define z{X) as in Eq. Let y denote a vector of real-valued 
measurements for a regression task. The posterior distribution of z{xi) given y, with Xi e X, is 
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Figure 1: Illustration of ^{^q_^2{x)), for — 1.0 with Gf, the c.d.f. of (a) the Laplace distri 



bution |(b)| the hyperbolic secant distribution (c) a Student-t inspired distribution, all with scale 
parameter b. Each plot shows three samples — dotted, dashed, solid — for growing b. As b increases 
the distributions become heavy-tailed and the gradient of G^^(<1>o,(t2 (x)) increases. 



derived by standard Gaussian computations as 

{z{x,)\X,y) ^Af {fi„af) 



fi, = Kix,,X)K{X,X)-'y 

a1 = K{x,,xO - K{x„X)KiX, X)-'K{X, x,). 



For our choice of K(X, X) one can show that — ior d Q D. To ensure that the posterior 
distributions agree at the two locations we require fid = fJ's-, which holds if measurements y satisfy 



e 3^gp = {y\ (K{xa, X) ~ k{xs,x)) K{x, x)-\j = 0} = I? 




A similar analysis can be carried out for the induced HPR model. By Eqs. ([5|-([7| HPR inference 
leads to identical distributions p\i-g,{z{xd)\X, y') = piip{z{xs)\X, y') with d S D if measurements y' 
in /-space satisfy 

y' e 3^hp - {y'\ (k{xd. X) - k{x,,x)) K{x, x)-^%^^, (Gfc(y')) - 0} 
= {2/'-G,-i(ci>o,.2(y))|ye3;gp}. 

To compare the shrinkage properties of GPR and HPR we analyze select pairs of measurements in 
3^gp and 3^hp- The derivation requires that G^^('I>o,(t2(-)) is strongly concave on (—00,0], strongly 
convex on [0, -l-oo) and has gradient > 1 on K. To see intuitively why this should hold for 
heavy-tailed marginals, note that for Gf, with fatter tails than a Gaussian, |G^^($o,CT2(a^))| should 
eventually dominate \^~]^2{^o.(t'^{x))\ — {b/a)\x\. Figure jlj demonstrates graphically that the 
assumption holds for several choices of Gf,, provided b is large enough, i.e., that gf, has sufficiently 
heavy tails. Indeed, it can be shown that for scale parameters 6 > 0, the first and second derivatives 
of G^"^($o,ct2(-)) scale hnearly with b. Consider a measurement ^ y € J^gp with sign (^(xc;)) = 
sign (y(a;d')) ,Vd, d' G D. Analyzing such y is relevant, as we are most interested in comparing 
how multiple reinforcing observations at clustered locations and a single isolated observation are 
absorbed during inference. By definition of 3^gp, for d* ~ argmax^g^jji/dl we have \yd*\ < \ys\ as 
long as n > 2. The corresponding element y' = G^^($o,(t2 (y)) € 3^hp then satisfies 



\y'ixs)\^\G^\<S'o,a^{y{xs)))\ 



> 



y{xd') 



yi^s) 



y'{xd') 



y{xd') 



y{xs) 



(8) 



Thus HPR inference leads to identical predictive distributions in /-space at the two locations even 
though the isolated observation y'{xs) has disproportionately larger magnitude than y'{xd*), rela- 
tive to the GPR measurements y{xs) and y{xd-'). As this statement holds for any y e J^gp satisfying 
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our earlier sign requirement, it indicates that HPR systematically shrinks isolated observations 
more strongly than GPR. Moreover, since the second derivative of G'^^($o,cr2 (■)) scales linearly 
with 6 > an intuitive connection suggests itself when looking at inequality ([s]): the heavier the 
marginal tails, the stronger the inequality and thus the stronger the selective shrinkage. 

The previous derivation exemplifies in an idealized setting that HPR leads to improved shrinkage 
of predictive distributions near isolated observations. More generally, because GPR transforms 
measurements only linearly, while HPR additionally pre-transforms measurements nonlinearly, 
our analysis suggests that for any GPR we can find an HPR model which leads to stronger selec- 
tive shrinkage. The result has intuitive parallels to the parametric case: just as €i-regularization 
improves shrinkage of parametric estimators, heavy-tailed processes improve shrinkage of nonpara- 
metric estimators. Although our analysis kept K{X,X) fixed for GPR and HPR, in practice we 
are free to tune the kernel to yield a desired scale of predictive distributions. Lastly, although our 
analysis has been carried out for regression, it motivates us to explore heavy-tailed processes in 
the classification setting. 

4 Heavy-tailed process classification 

The derivation of our heavy-tailed process classifier (HPC) is similar to that of multiclass GPC 
with Laplace approximation in Section 3.5 of Rasmussen and Williams 12]. However, due to the 
nonlinear transformations involved, some nice properties of their derivation are lost. We revert 
notation and let y denote a vector of class labels. For a C-class classification problem with n 
training points we introduce a vector of nC latent function measurements 

f = Cfi fi f2 f2 fC /•C^T 

J VJl'"'"'Jn'./l7*'"Wni"''7Jl ^ • • • 1 J n J 

For each block c e {1, . . . , C} of n variables we define an independent heavy-tailed process prior 
using Eq. Q with a kernel matrix K^. Equivalently, we can define the prior jointly on / by letting 
K he a. block-diagonal kernel matrix with blocks ifi, . . . , Kc- Each kernel matrix Kc is defined by 
a (possibly different) symmetric positive semidefinite kernel with its own set of parameters. The 
following construction relaxes the earlier condition that diag [K) = cr^l and instead views $0,0-2 (•) 
as just some nonlinear transformation with parameter . By this relaxation we effectively adopt 
Liu et al.'s [S] interpretation that Eq. Q defines the copula. The scale parameters b could in 
principle vary across the nC variables, but we keep them constant at least within each block of n. 
Labels y are represented in a 1-of-n form and generated by the following observation model 

p{yt = m) = < = ^^^^^%^- (9) 

Ec' exp{/f } 

For inference we are ultimately interested in computing 

p{yl = =Ep(/.|;,,,,..) ( £^7xVf/f} ) ' ^^^^ 

where /* = {fl , ■ ■ ■ , f^')^ ■ The previous section motivates the hope that improved selective 
shrinkage will occur in y, a;*), provided the prior marginals have sufficiently heavy tails. 

4.1 Inference 

As in GPC, most of the intractability lies in computing the predictive distribution p{f^\X,y,x^). 
We use the Laplace approximation to address this issue: a Gaussian approximation to p{z\X,y) 
is found and then combined with the Gaussian p(z*|X, z,x*) to give us an approximation to 
p{z^,\X,y, x.^,). This is then transformed to a (typically non-Gaussian) distribution in /-space 
using a change of variables. Hence we first seek to find a mode and corresponding Hessian matrix 
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of the log posterior \ogp{z\X,y). Recalling the relation f ~ G^^ ^($0,0-2(2)), the log posterior can 
be written as 

J{z) = log p{y\ z) + log p{z) 

i c 

Let n be an nC x n matrix of stacked diagonal matrices diag (tt"^) for n-subvectors tt^ of tt. With 
W = diag (tt) — nn^, the gradients are 

VJ(z) = diag (^^^ {y-n)-K-^z 

VMiz) . diag (g) diag (, - .) - diag (|) M^diag (f ) - K-\ 

Unlike in Rasmussen and Williams |T^, — V^J(2;) is not generally positive definite owing to its 
first term. For that reason we cannot use a Newton step to find the mode and instead resort to a 
simpler gradient method. Once the mode z has been found we approximate the posterior as 

p{z\X, v) « q(z\X, v)=N {z, -VV(z)-i) , 

and use this to approximate the predictive distribution by 

9(2*1^, y, a;*) j p{z^\X,z,x^)q{z\X,y)df. 

Since we arranged for both distributions in the integral to be Gaussian, the resulting Gaussian can 
be straightforwardly evaluated. Finally, to approximate the one-dimensional integral with respect 



to p{fi.\X,y,Xi.) in Eq. (fO) we could either use a quadrature method, or generate samples from 
q{Zi,\X,y,x^.), convert them to /-space using G^^ ($0,(72 (■)) and then approximate the expectation 
by an average. We have compared predictions resulting using the latter method with those of 
a Gibbs sampler; the Laplace approximation matched Gibbs results well, while costing only a 
fraction of time to compute. 



4.2 Parameter estimation 

Using a derivation similar to that in section 3.4.4 of [T2], we have for / = Gj^^ ($0.0.2(5)) that the 
Laplace approximation of the marginal log likelihood is 

logp(?;|a;) « logq{y\x) = J{z) - ^ log | - 27rVV(z)| (11) 
= y^f-J2logJ2exp {f^} - \z^K-^z - ^ log \K\ - ^ log | - VV(z)| -I- const. 

i c 

We optimize kernel parameters 9 by taking gradient steps on logq{y\x). The derivative needs 
to take into account that perturbing the parameters can also perturb the mode z found for the 
Laplace approximation. At an optimum V J(z) must be zero, so that 

z-i^diag|^^^ (y-TT), (12) 

where tt is defined as in Eq. ^ but using / rather than /. Taking derivatives of this equation 
allows us to compute the gradient dz/dO. Differentiating the marginal likelihood we have 

dlogq{y\x) , ^^-r I df\ dz dz ^ ^ 1 -t 1 dK , ^ 

1 fr^ ldK\ 1 /„2x/^X idW(z) 



2 V de 2 \ ' dO 
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Figure 2: (a) Schematic of a protein section. The backbone is the sequence of C , N,Ca,C' , N 



atoms. An amino-acid-specific sidechain extends from the Ca atom at one of three discrete angles 
known as "rotamers." |(b)| Ramachandran plot of 400 ($, ^) measurements and corresponding 
rotamers (by shapes/colors) for amino-acid arg. The dark shading indicates the sparse region we 
considered in producing results in Figure |3] Progressively lighter shadings indicate how the sparse 
region was grown to produce Figure [4] 



The remaining gradient computations are straightforward, albeit tedious. In addition to optimiz- 
ing the kernel parameters, it may also be of interest to optimize the scale parameter b of marginals 
Gfc. Again, differentiating Eq. (12 1 with respect to b allows us to compute dz/db. We note that 



when perturbing b we change / by changing the underlying mode z as well as by changing the 
parameter b which is used to compute / from z. Suppressing the detailed computations, the 
derivative of the marginal log likelihood with respect to b is 



dlogq{y\x) 
db 



db 



dz 
db 



1 



K-'z~ -tr VV(z) 



db 



5 Experiments 

To a first approximation, the three-dimensional structure of a folded protein is defined by pairs 
of continuous backbone angles (^j^E"), one pair for each amino-acid, as well as discrete angles, 
so-called rotamers, that define the conformations of the amino-acid sidechains that extend from 
the backbone. The geometry is outlined in Figure |2(a) There is a strong dependence between 
backbone angles (<&, 5*) and rotamer values; this is illustrated in the "Ramachandran plot" shown 
in Figure [2(b)[ which plots the backbone angles for each rotamer (indicated by the shapes/colors). 
The dependence is exploited in computational approaches to protein structure prediction, where 
estimates of rotamer probabilities given backbone angles are used as one term in an energy function 
that models native protein states as minima of the energy. Protein structures are predicted by 
minimizing the energy function. Poor estimates of rotamer probabilities in sparse regions can 
derail the prediction procedure. Indeed, sparsity has been a serious problem in state-of-the-art 
rotamer models based on kernel density estimates (Roland Dunbrack, personal communication). 
Unfortunately, we have found that GPC is not immune to the sparsity problem. 

To evaluate our algorithm we consider rotamer-prediction tasks on the 17 amino- acids (out of 20) 
that have three rotamers at the first dihedral angle along the sidechair]^ Our previous work thus 

^Residues ala and gly arc non-discrete while pro has only two rotamers at the first dihedral angle. 
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Figure 3: Rotamer prediction rates in percent in (a) sparse and |(b)| dense regions. Both flavors of 
HPC (hyperbolic secant and Laplace marginals) significantly outperform GPC in sparse regions 
while performing competitively in dense regions. 



applies with the number of classes C = 3 and the covariates being ($, ^) angle pairs. Since the 
input space is a torus we defined GPC and HPC using the following von Mises-inspired kernel for 
d-dimensional angular data: 




k{xi,Xj) = cr^exp < A '^cos{xi^k - Xj^u) - d 



k = l 



where Xi^k,Xj^k G [0, 27r] and cr^,A > To find good GPC kernel parameters we optimize an 
^2-regularized version of the Laplace approximation to the log marginal likelihood reported in 
Eq. 3.44 of [12]. For HPC we let Gf, be either the centered Laplace distribution or the hyperbolic 
secant distribution with scale parameter b. We estimate HPC kernel parameters as well as b by 



similarly maximizing an ^2 -regularized form of Eq. ( 11 ). In both cases we restricted the algorithms 
to training sets of only 100 datapoints. Since good regularization parameters for the objectives are 
not known a priori we train with and test them on a grid for each of the 17 rotameric residues in 
ten-fold cross-validation. To find good regularization parameters for a particular residue we look 
up that combination which, averaged over the ten folds of the remaining 16 residues, produced 
the best test results. Having chosen the regularization constants we report average test results 
computed in ten-fold cross validation. We evaluate the algorithms on predefined sparse and dense 
regions in the Ramachandran plot, as indicated by the background shading in Figure [2 (b)[ Across 
17 residues the sparse regions usually contained more than 70 measurements (and often more than 
150), each of which appears in one of the 10 cross-validation folds. Figure [3] compares the label 
prediction rates on the dense and sparse regions. Averaged over all 17 residues HPC outperforms 
GPC by 5.79% with Laplace and 7.89% with hyperbolic secant marginals. With Laplace marginals 
HPC underperforms GPC on only two residues in sparse regions: by 8.22% on gin, and by 2.53% 
on his. On dense regions HPC lies within 0.5% on 16 residues and only degrades once by 3.64% 
on his. Using hyperbolic secant marginals HPC often improves GPC by more than 10% on 
sparse regions and degrades by more than 5% only on cys and his. On dense regions HPC 
usually performs within 1.5% of GPC. In Figure [4] we show how the average rotamer prediction 
rate across 17 residues changes as we grow the sparse region to include more measurements from 
dense regions. The growth of the sparse region is indicated by progressively lighter shadings in 
Figure |2(b)[ As more points are included the significant advantage of HPC lessens. Eventually 
GPC does marginally better than HPC. The values reported in Figure |3] correspond to the dark 
shaded region, which contains an average of 155 measurements per residue. 



^The function cos(xi k — Xj,k) = [cos(xi j;), sin(xi j.)][cos(Xj jj), sin(a;j_fe)]^ is a symmetric positive semi-definite 
kernel. By Propositions 3.22 (i) and (ii) and Proposition 3.25 in Shawe- Taylor and Cristianini 14 , so is k{xi,Xj) 
above. 
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Figure 4: Average rotamer prediction rate in the sparse region for both flavors of HPC as well as 
standard GPC as a function of the average number of points in the sparse region. 

6 Related research 

Copulas [To] allow convenient modelling of multivariate correlation structures as separate from 
marginal distributions. Early work by Song !16j used the Gaussian copula to generate complex 
multivariate distributions by complementing a simple copula form with marginal distributions of 
choice. Popularity of the Gaussian copula in the financial literature is generally credited to Li [S] 
who used it to model correlation structure for pairs of random variables with known marginals. 
More recently, the Gaussian process has been modified in a similar way to ours by Snelson et 
al. [H] who called the resulting stochastic process a Warped Gaussian Process. They demonstrate 
that posterior distributions can better approximate the true noise distribution if the transforma- 
tion defining the warped process is learned. Jaimungal and Ng [7] have extended this work to 
model multiple parallel time series with marginally non-Gaussian stochastic processes under the 
name of Kernel-based Copula Processes (KCPs). Their work uses a "binding copula" to combine 
several subordinate copulas into a joint model. Bayesian approaches focusing on estimation of 
the Gaussian copula covariance matrix for a given dataset are given in [H [11] . With the advent 
of larger datasets, research has also focused on estimation in high-dimensional settings. Liu et 
al. [5] do away with a prior on covariance matrices and give consistency results for a covariance 
estimator in high-dimensional settings. 

7 Conclusions 

This paper has analyzed learning scenarios where outliers are observed in the input space, rather 
than the output space as commonly discussed in the literature. We illustrated heavy-tailed pro- 
cesses as a straightforward extension of CPs and an elegant and economical way to improve the 
robustness of estimators in sparse regions beyond those of GP-based methods. This was demon- 
strated both by a theoretical analysis and experimental results. Importantly, because heavy-tailed 
processes are based on a CP, they inherit many of its favorable computational properties; predic- 
tive inference in regression, for instance, is straightforward. For approximate inference in more 
complicated models utilizing heavy-tailed processes, this paper exemplifies that we can borrow 
many ideas from standard CP models. Since heavy-tailed processes have a parsimonious repre- 
sentation, they can be easily used as building blocks in more complicated models where currently 
CPs are used. The benefits of heavy-tailed processes on selective shrinkage thus extend to many 
other GP-based models that currently struggle with covariate shift. 
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